Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Return BoolArray for string ops when backed by StringArray #30239

Merged

Conversation

TomAugspurger
Copy link
Contributor

ref #29556

@@ -1825,7 +1825,7 @@ def test_extractall_same_as_extract_subject_index(self):

def test_empty_str_methods(self):
empty_str = empty = Series(dtype=object)
empty_int = Series(dtype=int)
empty_int = Series(dtype="int64")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone with a 32-bit platform confirm the behavior on master for .str methods returning int dtype? Is it int32 or int64?

We may have been inconsistent before, and returned int32 for empty, but int64 for non-empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct; not sure we return a platform int save maybe niche cases

@WillAyd
Copy link
Member

WillAyd commented Dec 12, 2019

Do we want to optimize the BooleanArray first before returning it from methods like this? I think could have non-trivial memory impacts doing this right now

@TomAugspurger
Copy link
Contributor Author

Recall this is just for StringDtype, not object-dtype backed Series. So there's no harm to the current users.

And improving performance later is much easier than breaking API.


mask = isna(arr)

assert isinstance(arr, StringArray)
arr = np.asarray(arr)

if is_integer_dtype(dtype):
if is_integer_dtype(dtype) or is_bool_dtype(dtype):
if is_integer_dtype(dtype):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MyPy apparently doesn't like this...

pandas/core/strings.py:164: error: Incompatible types in assignment (expression has type "Type[BooleanArray]", variable has type "Type[IntegerArray]")

Any suggestions on how to please the type checker?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea if you have assignment in an if...else block the type is inferred from the first one that appears.

So before the block you can just declare constructor: Type[Union[IntegerArray, BooleanArray]] or maybe even something simpler like constructor: Type[ExtensionArray] depending on what is valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I think I got it.


mask = isna(arr)

assert isinstance(arr, StringArray)
arr = np.asarray(arr)

if is_integer_dtype(dtype):
if is_integer_dtype(dtype) or is_bool_dtype(dtype):
constructor: Union[Type[IntegerArray], Type[BooleanArray]]
Copy link
Member

@WillAyd WillAyd Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constructor: Union[Type[IntegerArray], Type[BooleanArray]]
constructor: Type[Union[IntegerArray, BooleanArray]]

Optional but less verbose if you put the Union inside of the Type

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 12, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@WillAyd WillAyd added this to the 1.0 milestone Dec 13, 2019
@@ -74,6 +74,7 @@ These are places where the behavior of ``StringDtype`` objects differ from
l. For ``StringDtype``, :ref:`string accessor methods<api.series.str>`
that return **numeric** output will always return a nullable integer dtype,
rather than either int or float dtype, depending on the presence of NA values.
Methods returning **boolean** output will return a nullable boolean dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a doc-link here (can be followup)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to hold off on this. I'm planning to restructure the docs for integer / boolean / NA once all these PRs are in.

if is_integer_dtype(dtype):
constructor = IntegerArray
else:
constructor = BooleanArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to combine the above if/else reading this is super confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an easy way. dtype.construct_array_type isn't an option since we have a the functions using na_map use dtypes like bool, int, since they work with both object-dtype arrays returning numpy arrays, or StringArray returning EAs. So we can have either a NumPy dytpe or an extension type here.

@TomAugspurger
Copy link
Contributor Author

Planning to merge this in a few hours.

@TomAugspurger TomAugspurger merged commit 5b25df2 into pandas-dev:master Dec 19, 2019
@TomAugspurger TomAugspurger deleted the string-str-returns-boolean branch December 19, 2019 17:09
@jorisvandenbossche
Copy link
Member

I'm going to hold off on this. I'm planning to restructure the docs for integer / boolean / NA once all these PRs are in.

What's your current plan?
In one of the PRs we discussed a little about this (gather them all on a single page vs additional nesting), but don't know where any more.

AlexKirko pushed a commit to AlexKirko/pandas that referenced this pull request Dec 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants